Multimodal Local Perception Bilinear Pooling for Visual Question Answering
نویسندگان
چکیده
منابع مشابه
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise multiplication or addition, as well as concatenation of the visual a...
متن کاملCompact Tensor Pooling for Visual Question Answering
Performing high level cognitive tasks requires the integration of feature maps with drastically different structure. In Visual Question Answering (VQA) image descriptors have spatial structures, while lexical inputs inherently follow a temporal sequence. The recently proposed Multimodal Compact Bilinear pooling (MCB) forms the outer products, via count-sketch approximation, of the visual and te...
متن کاملBilinear Pooling and Co-Attention Inspired Models for Visual Question Answering
In recent years, open-ended visual question answering has been an area of active research. In this work, we present our exploration of two state-of-art architectures including the Multi-modal Compact Bi-linear Pooling (MCB) and Dynamic Memory Network (DMN) and analysis of the result and performance of the models. We found both models to perform comparably on the VQA v2.0 dataset based on predic...
متن کاملBeyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multi-modal feature fusion that is able to capture the complex int...
متن کاملMultimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation
In state-of-the-art Neural Machine Translation, an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, whe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2018
ISSN: 2169-3536
DOI: 10.1109/access.2018.2873570